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Abstract 

This article exposes the failure of some big neural networks to leverage added 
capacity to reduce underfitting. Past research suggest diminishing returns when 
increasing the size of neural networks. Our experiments on ImageNet LSVRC- 
2010 show that this may be due to the fact there are highly diminishing returns for 
capacity in terms of training error, leading to underfitting. This suggests that the 
optimization method - first order gradient descent - fails at this regime. Directly 
attacking this problem, either through the optimization method or the choices of 
parametrization, may allow to improve the generalization error on large datasets, 
for which a large capacity is required. 



1 Introduction 

Deep learning and neural networks have achieved state-of-the-art results on vision Q language [^] , 
and audio-processing tasks |^| All these cases involved fairly large datasets, but in all these cases, 
even larger ones could be used. One of the major challenges remains to extend neural networks on 
a much larger scale, and with this objective in mind, this paper asks a simple question: is there an 
optimization issue that prevents efficiently training larger networks? 



Prior evidence of the failure of big networks in the litterature can be found for example in Coates 
et al. (201 1 1, which shows that increasing the capacity of certain neural net methods quickly reaches 
a point of diminishing returns on the test error. These results have since been extended to other 
types of auto-encoders and RBMs (Rifai et al. 2011 Courville et al. 201 l| l. Furthermore, ICoates 



et al.\ ( 20TT) shows that while neural net methods fail to leverage added capacity K-Means can. 



This has allowed K-Means to reach state-of-the-art performance on CIFAR-10 for methods that do 
not use artificial transformations. This is an unexpected result because K-Means is a much dumber 
unsupervised learning algorithm when compared with RBMs and regularized auto-encoders. Coates 
\et al. | ( pOTT) argues that this is mainly due to K-Means making better use of added capacity, but it 
does not explain why the neural net methods failed to do this. 



2 Experimental Setup 

We will perform experiments with the well known ImageNet LSVRC-20 10 object detection datasej^] 
The subset used in the Large Scale Visual Recognition Challenge 2010 contains 1000 object cate- 
gories and 1.2 million training images. 



recognition benchmark 



Krizhevsky etcUA I 2012 reduced by almost one half the error rate on the 1000-class ImageNet object 



^Mikolov et al. 



1 201 1 1 reduced perplexity on WSI by 40% and speech recognition absolute word error rate 



Seide et al. 



by > 1%. 

3 For speech recognition, 
datasets of 309 hours 

4 http://www.image-net.org/challenges/LSVRC/2010/ 



201 1 1 report relative word error rates decreasing by about 30% on 
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This dataset has many attractive features: 



The task is difficult enough for current algorithms that there is still room for much improve- 
ment. For instance, |Krizhevsky et a/.| ( |2012| l was able to reduce the error by half recently. 
What's more the state-of-the-art is at 15.3% error. Assuming minimal error in the human 
labelling of the dataset, it should be possible to reach errors close to 0%. 

Improvements on ImageNet are thought to be a good proxy for progress in object recogni- 
tion (jDeng£FaT]|2009]l. 



3. It has a large number of examples. This is the setting that is commonly found in industry 
where datasets reach billions of examples. Interestingly, as you increase the amount of 
data, the training error converges to the generalization error. In other words, reducing 
training error is well correlated with reducing generalization error, when large datasets are 
available. Therefore, it stands to reason that resolving underrating problems may yield 
significant improvements. 

We use the features provided by the Large Scale Visual Recognition Challenge 201 The images 
are convolved with SIFT features, then K-Means is used to form a visual vocabulary of 1000 visual 
words. Following the litterature, we report the Top-5 error rate only. 

The experiments focus on the behavior of Multi-Layer Perceptrons (MLP) as capacity is increased. 
This is done by increasing the number of hidden units in the network. The final classification layer of 
the network is a softmax over possible classes (softmax(x) = e _x /^ i e~~ Xi ). The hidden layers 
use the logistic sigmoid activation function (tr(x) = 1 / (i +e -*)). We i nitialize the weights of the 
hidden layer according to the formula proposed by |Glorot and Bengio| ( |2010] ). The parameters of 
the classification layer are initialized to 0, along with all the bias (offset) parameters of the MLP. 

The hyper-parameters to tune are the learning rate and the number of hidden units. We are in- 
terested in optimization performance so we cross-validate them based on the training error. We 
use a grid search with the learning rates taken from {0.1,0.01} and the number of hiddens from 
{1000, 2000, 5000, 7000, 10000, 15000}. When we report the performance of a network with a 
given number of units we choose the best learning rate. The learning rate is decreased by 5% every - 
time the training error goes up aftern an epoch. We do not use any regularization because it would 
typically not help to decrease the training set error. The number of epochs is set to 300 s that it is 
large enough for the networks to converge. 

The experiments are run on a cluster of Nvidia Geforce GTX 580 GPUs with the help of the Theano 
library (Bergstr a et al.\ 20101. We make use of HDF5 (Folk et al. 2011 1 to load the dataset in a 
lazy fashion because of its large size. The shortest training experiment took 10 hours to run and the 
longest took 28 hours. 



3 Experimental Results 

Figure[T]shows the evolution of the training error as the capacity is increased. The common intuition 
is that this increased capacity will help fit the training set - possibly to the detriment of generalization 
error. For this reason practitioners have focused mainly on the problem of overfitting the dataset 
when dealing with large networks - not underfitting. In fact, much research is concerned with proper 



regularization of such large networks ( Hinton et al. 2012 2006 1. 



However, Figure |2]re veals a problem in the training of big networks. This figure is the derivative of 
the curve in Figure [T] (using the number of errors instead of the percentage). It may be interpreted 
as the return on investment (ROI) for the addition of capacity. The Figure shows that the return 
on investment of additional hidden units decreases fast, where in fact we would like it to be close 
to constant. Increasing the capacity from 1000 to 2000 units, the ROI decreases by an order of 
magnitude. It is harder and harder for the model to make use of additional capacity. The red line is 
a baseline where the additional unit is used as a template matcher for one of the training errors. In 
this case, the number of errors reduced per unit is at least 1 . We see that the MLP does not manage 
to beat this baseline after 5000 units, for the given number of training iterations. 



5 http://www.image-net.org/challenges/LSVRC/2010/download-public 
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Figure 1: Training error with respect to the capacity of a 1 -layer sigmoidal neural network. This 
curve seems to suggest we are correctly leveraging added capacity. 




Figure 2: Return on investment on the addition of hidden units for a 1 -hidden layer sigmoidal neural 
network. The vertical axis is the number of training errors removed per additional hidden unit, after 
300 epochs. We see here that it is harder and harder to use added capacity. 



For reference, we also include the learning curves of the networks used for Figure [TJ and [2] in Figure 
[3] We see that the curves for capacities above 5000 all converge towards the same point. 

4 Future directions 

This rapidly decreasing return on investment for capacity in big networks seems to be a failure of 
first order gradient descent. 

In fact, we know that the first order approximation fails when there are a lot of interactions between 
hidden units. It may be that adding units increases the interactions between units and causes the 
Hessian to be ill-conditioned. This reasoning suggests two research directions: 

• methods that break interactions between large numbers of units. This helps the Hessian 
to be better conditioned and will lead to better effectiveness for first-order descent. This 
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Figure 3: Training error with respect to the number of epochs of gradient descent. Each line is a 
1-hidden layer sigmoidal neural network with a different number of hidden units. 



type of method can be implemented efficiently. Examples of this approach are sparsity and 
orthognality penalties. 

methods that model interactions between hidden units. For example, second order methods 
(Martens 2010) and natural gradient methods ( Le Roux et al. | [2008). Typically, these are 
expensive approaches and the challenge is in scaling them to large datasets, where stochas- 
tic gradient approaches may dominate. The ideal target is a stochastic natural gradient or 
stochastic second-order method. 



The optimization failure may also be due to other reasons. For example, networks with more ca- 
pacity have more local minima. Future work should investigate tests that help discriminate between 
ill-conditioning issues and local minima issues. 

Fixing this optimization problem may be the key to unlocking better performance of deep networks. 
Based on past observations ( Bengio , 2009 ; Erhan et al. 2010| l, we expect this optimization problem 
to worsen for deeper networks, and our experimental setup should be extended to measure the effect 
of depth. As we have noted earlier, improvements on the training set error should be well correlated 
with generalization for large datasets. 
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